Introduction

This assignment focuses on applying the Feature Engineering processes and the Evaluation methods to solve a practical scenario: Predict the price of houses.

Data Reading and preparation

The dataset is offered in two separated fields, one for the training and another one for the test set.

original_training_data = read.csv(file = file.path("dataset/house_price_train.csv"))
original_test_data = read.csv(file = file.path("dataset/house_price_test.csv"))

To avoid applying the Feature Engineering process twice (once for training and once for test), I will join both datasets, apply the FE and then split the datasets again. However, if I try to do join the two dataframes as they are, we will get an error because they do not have the same columns: test_data does not have a column price. Therefore, I first create this column in the test set and then we join the data

original_test_data$price <- 0
dataset <- as.data.table(rbind(original_training_data, original_test_data))

Let’s now visualize the dataset to see where to begin

##        id                   date           price            bedrooms     
##  Min.   :1.000e+06   6/23/2014:  142   Min.   :      0   Min.   : 1.000  
##  1st Qu.:2.123e+09   6/25/2014:  131   1st Qu.: 220000   1st Qu.: 3.000  
##  Median :3.905e+09   6/26/2014:  131   Median : 384000   Median : 3.000  
##  Mean   :4.580e+09   7/8/2014 :  127   Mean   : 431877   Mean   : 3.373  
##  3rd Qu.:7.309e+09   4/27/2015:  126   3rd Qu.: 580000   3rd Qu.: 4.000  
##  Max.   :9.900e+09   3/25/2015:  123   Max.   :7700000   Max.   :33.000  
##                      (Other)  :20817                                     
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.500   Min.   :  370   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1430   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.116   Mean   : 2080   Mean   :  15099   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10685   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##                                                                   
##    waterfront            view          condition        grade       
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.00   Min.   : 3.000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.00   1st Qu.: 7.000  
##  Median :0.000000   Median :0.0000   Median :3.00   Median : 7.000  
##  Mean   :0.007547   Mean   :0.2343   Mean   :3.41   Mean   : 7.658  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.00   3rd Qu.: 8.000  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.00   Max.   :13.000  
##                                                                     
##    sqft_above   sqft_basement       yr_built     yr_renovated    
##  Min.   : 370   Min.   :   0.0   Min.   :1900   Min.   :   0.00  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.00  
##  Median :1560   Median :   0.0   Median :1975   Median :   0.00  
##  Mean   :1789   Mean   : 291.7   Mean   :1971   Mean   :  84.46  
##  3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.00  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.00  
##                                                                  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987  
##  3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##                                                                 
##    sqft_lot15    
##  Min.   :   651  
##  1st Qu.:  5100  
##  Median :  7620  
##  Mean   : 12758  
##  3rd Qu.: 10083  
##  Max.   :871200  
## 

Data Cleaning

In this section I am going to preform some data cleaning.

The feature Id and date are not going to offer any advantage for prediction.

Factorize features

If we go back to the summary of the dataset we can identify some numerical features that are actually categories: What we have to do is to convert them to the proper ‘class’ or ‘type’.

dataset$condition <- factor(dataset$condition)
dataset$grade <- factor(dataset$grade)
dataset$zipcode <- factor(dataset$zipcode)
dataset$yr_built <- factor(dataset$yr_built)
dataset$yr_renovated <- factor(dataset$yr_renovated)
dataset$view <- factor(dataset$view)

# lets now turn characters into factors
dataset[ ,names(dataset)[sapply(dataset, is.character)]:=lapply(.SD,as.factor),
           .SDcols = names(dataset)[sapply(dataset, is.character)]]

Hunting NAs

Let’s check if there are NAs in the dataset

#Counting columns with null values.
na.cols <- which(colSums(is.na(dataset)) > 0)
paste('There are', length(na.cols), 'columns with missing values')
## [1] "There are 0 columns with missing values"

Outliers

We will now focus on numerical values. In this section I will detect numerical features which present outliers and I will clip those values.

# clipping the outliers
for (i in col_outliers_to_remove){
  qnt <- quantile(dataset[[i]], probs=c(.25, .75))
  caps <- quantile(dataset[[i]], probs=c(.05, .95))
  H <- 1.5 * IQR(dataset[[i]])
  dataset[[i]][dataset[[i]] < (qnt[1] - H)] <- caps[1]
  dataset[[i]][dataset[[i]] > (qnt[2] + H)] <- caps[2]
}

# Plot again the box plots to verify if the procedure has been succesful
for (i in col_outliers_to_remove){
  boxplot(dataset[,i, with=FALSE], type="p", xlab=i)
}

Skewness

We now need to detect skewness in the Target value. Let’s see what is the effect of skewness on a variable, and plot it using ggplot. In order to get rid of the skewness I will use the log of the values of that feature, to flatten it.

We therefore transform the target value applying log

The same “skewness” observed in the target variable also affects other variables. To facilitate the application of the regression model we are going to also eliminate this skewness. For numeric feature with excessive skewness, perform log transformation

Now, let’s compute the skewness of each feature that is not ‘factor’ nor ‘character’. NB: I will not include lat and long.

column_types <- sapply(names(dataset), function(x) {
    class(dataset[[x]])
  }
)
numeric_columns <- names(column_types[column_types != "factor"])
numeric_columns <- numeric_columns[numeric_columns != 'lat' & numeric_columns != 'long']

At this point, we need to calculate the skewness of each numerical column.

# skew of each variable
skew <- sapply(numeric_columns, function(x) { 
    e1071::skewness(dataset[[x]], na.rm = T)
  }
)

To reduce the skewness in the entire dataset it is possible to apply the log for those column with a skewness value greater than a given threshold that we’ve set in 1.

# transform all variables above a threshold skewness.
skew <- skew[abs(skew) > skewness_threshold]
for(x in names(skew)) {
  dataset[[x]] <- log(dataset[[x]] + 1)
}

Feature Creation

In this section I will create some new features to improve the predictive power of the dataset.

# having a look at the columns
colnames(dataset)
##  [1] "price"         "bedrooms"      "bathrooms"     "sqft_living"  
##  [5] "sqft_lot"      "floors"        "waterfront"    "view"         
##  [9] "condition"     "grade"         "sqft_above"    "sqft_basement"
## [13] "yr_built"      "yr_renovated"  "zipcode"       "lat"          
## [17] "long"          "sqft_living15" "sqft_lot15"
# Creating a total square feet feature
dataset$TotalSqFeet <- as.numeric(dataset$sqft_basement + dataset$sqft_above + dataset$sqft_lot + dataset$sqft_living)

# Creating a Remodrnatoin variable (0 = No Remodeling, 1 = Remodeling)
dataset$Remod <- ifelse(dataset$yr_renovated != 0, 1, 0)
dataset$Remod <- factor(dataset$Remod) 

# Creating a new feature for the total number of rooms in the house
dataset$rooms <- dataset$bedrooms + dataset$bathrooms

# Creating a new feature with the distance
dataset$distance <- distGeo(as.matrix(dataset[,c('long','lat')]), c(0,0))
# We will create a label that will agregate into "others" those grade with less than 3% of share
niche_grade<-names(which(summary(dataset$grade)/nrow(dataset)<0.01))
dataset[, grade_agg:=as.factor(ifelse(grade%in%niche_grade,'others',as.character(grade)))]

summary(dataset$grade)/nrow(dataset)
summary(dataset$grade_agg)/nrow(dataset)
sum(summary(dataset$grade_agg)/nrow(dataset)) 

dataset[, length(levels(grade_agg))]
dataset[, length(levels(grade))]
dataset[, length(levels(grade_agg))/length(levels(grade))-1] # important reduction in factor cathegories
dataset[, grade_agg:=factor(grade_agg, levels=names(sort(summary(dataset$grade_agg), dec=T)))]

# we drop off the former grade variable
dataset <- dataset[, grade:=NULL]
# let's see the median price per ZipCode and Bin the ZipCode in 3 chategory of price
summary(dataset[price!= 0, price])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.26   12.68   13.02   13.05   13.38   15.86
# let's see the median price per ZipCode and Bin the ZipCode in 3 chategory of price
summary(dataset[price!= 0, price])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.26   12.68   13.02   13.05   13.38   15.86
plot(dataset[price != 0 ,list(price=median(price)), by = zipcode], type="p")

# Defining the 3 bins for zipcode
zip_price <- dataset[price != 0 ,list(price=median(price)), by = zipcode]
cheep_zip <- zip_price[price < 12.5, zipcode]
medium_zip <- zip_price[price >= 12.5 & price < 13.5, zipcode]
exp_zip <- zip_price[price >= 13.5, zipcode]

# Applyng the bins
dataset$NeighBin[dataset$zipcode %in% exp_zip] <- 3
dataset$NeighBin[dataset$zipcode %in% medium_zip] <- 2
dataset$NeighBin[dataset$zipcode %in% cheep_zip] <- 1

# Factorizing the new feature
dataset$NeighBin <- factor(dataset$NeighBin)

Modelling

Train, Validation Spliting

To facilitate the data cleaning and feature engineering we merged train and test datasets. We now split them again to create our final model.

training_data <- dataset[which(dataset$price!=0),]
test <- dataset[which(dataset$price==0),]

We are going to split the annotated dataset in training and validation for the later evaluation of our regression models

splits <- splitdf(training_data)
training <- splits$trainset
validation <- splits$testset

Baseline

Let’s try first a baseline including all the features to evaluate the impact of the feature engineering.

lm.model(training, validation, "Baseline")

Information Gain Selection

Let’s try to use information gain to select the most important variables. In this section I will also use a cross validation procedure to select the best threshold for my model. As it is possible to notice the results are better than the baseline so from now on I will use the ig_training for all the models.

# applying information gain with the optimal trashold found
weights<- data.frame(information.gain(formula, training))
weights$feature <- rownames(weights)
weights[order(weights$attr_importance, decreasing = TRUE),]
##               attr_importance       feature
## zipcode           0.407883630       zipcode
## grade_agg         0.288890230     grade_agg
## lat               0.277114181           lat
## sqft_living       0.273301970   sqft_living
## TotalSqFeet       0.250527229   TotalSqFeet
## sqft_living15     0.218212778 sqft_living15
## sqft_above        0.195533477    sqft_above
## NeighBin          0.163604183      NeighBin
## distance          0.152663162      distance
## bathrooms         0.143773960     bathrooms
## rooms             0.128156052         rooms
## yr_built          0.066525710      yr_built
## sqft_lot15        0.061870353    sqft_lot15
## bedrooms          0.059315821      bedrooms
## floors            0.057253847        floors
## long              0.052752360          long
## view              0.048681269          view
## sqft_lot          0.048579780      sqft_lot
## sqft_basement     0.042102735 sqft_basement
## yr_renovated      0.018447617  yr_renovated
## condition         0.008416930     condition
## waterfront        0.007289474    waterfront
## Remod             0.005333362         Remod
information_gain_features <- weights$feature[weights$attr_importance > information_gain_CV(training, validation,
                                                                                           int_min = 0.003, 
                                                                                           int_max = 0.009, 
                                                                                           steps = 0.001)]

ig_training <- training[,c(information_gain_features,"price"),with=FALSE]
ig_test <- test[,c(information_gain_features,"price"),with=FALSE]
ig_validation <- validation[,c(information_gain_features,"price"),with=FALSE]

lm.model(ig_training, ig_validation, "Information Gain")

Ridge Regression

Let’s try the Ridge Regression

lambdas <- 10^seq(-3, 0, by = .05)

set.seed(121)
train_control_config <- trainControl(method = "repeatedcv", 
                                     number = 5, 
                                     repeats = 1,
                                     returnResamp = "all")

ridge.mod <- train(formula, data = ig_training, 
               method = "glmnet", 
               metric = "RMSE",
               trControl=train_control_config,
               tuneGrid = expand.grid(alpha = 0, lambda = lambdas))

Plotting the RMSE for the different lambda values, we can see the impact of this parameter in the model performance. Small values seem to work better for this dataset.

plot(ridge.mod)

Plotting the coefficients for different lambda values. As expected the larger the lambda (lower Norm) value the smaller the coefficients of the features. However, as we can see at the top of the features, there is no feature selection; i.e., the model always consider the 277 parameters.

plot(ridge.mod$finalModel)

# evaluating the model 
ridge.mod.pred <- predict(ridge.mod, ig_validation)
ridge.mod.pred[is.na(ridge.mod.pred)] <- 0

my_data <- as.data.frame(cbind(predicted=(exp(ridge.mod.pred) -1), observed=(exp(validation$price) -1)))
ridge.mod.mape <- mape(ig_validation$price, ridge.mod.pred)
ridge.mod.price_error <- mean(abs((exp(ridge.mod.pred) -1) - (exp(validation$price) -1)))

# plottong the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("Ridge", ' MAPE: ', format(round(ridge.mod.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(ridge.mod.price_error, 0), nsmall=0), 
                        '€', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

Rank the variables according to the importance attributed by the model.

# Print, plot variable importance
plot(varImp(ridge.mod), top = 20) # 20 most important features

Lasso Regresion

Now Let’s try lasso in combination with information gain. Moreover, I will use cross validation to identify the best number of seed

# applying the lasso regression with the optimal seed value found above.
lambdas <- 10^seq(-3, 0, by = .05)

set.seed(lasso_CV(training, validation,20,60,10))
train_control_config <- trainControl(method = "repeatedcv", 
                                     number = 5, 
                                     repeats = 1,
                                     returnResamp = "all")

lasso.mod <- train(formula, data = ig_training, 
               method = "glmnet", 
               metric = "RMSE",
               trControl=train_control_config,
               tuneGrid = expand.grid(alpha = 1, lambda = lambdas))

Plotting the RMSE for the different lambda values, we can see the impact of this parameter in the model performance. Small values seem to work better for this dataset.

plot(lasso.mod)

Plotting the coefficients for different lambda values.

plot(lasso.mod$finalModel)

# evaluating the model
lasso.mod.pred <- predict(lasso.mod, ig_validation)
lasso.mod.pred[is.na(lasso.mod.pred)] <- 0

my_data <- as.data.frame(cbind(predicted=(exp(lasso.mod.pred) -1), observed=(exp(validation$price) -1)))
lasso.mod.mape <- mape(lasso.mod.pred, validation$price)
lasso.mod.price_error <- mean(abs((exp(lasso.mod.pred) -1) - (exp(validation$price) -1)))

# plotting the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("lasso", ' MAPE: ', format(round(lasso.mod.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(lasso.mod.price_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

Rank the variables according to the importance attributed by the model.

# Print, plot variable importance
plot(varImp(lasso.mod), top = 20) # 20 most important features

Random Forest

Now Let’s try Random Forest. After running all the models with the the defoult params, XGB was the best performing. For this reason I will apply the Hyper-parameters tuning just for xgb.

# defining the model
rf_0<-ranger(formula=formula, data=ig_training)
rf.pred<-predict(rf_0, data = ig_validation)$predictions

# evaluation
rf.mape <- mape(rf.pred, validation$price)
rf.price_error <- mean(abs((exp(rf.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(rf.pred) -1), observed=(exp(validation$price) -1)))

# plotting the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("Random Forest", ' MAPE: ', format(round(rf.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(rf.price_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

XGBoost

In this section we’ll use xgb. First, we have to transform the data to allow xgb to work. We need to create dummies and to cast the data into a matrix. Furthermore, I will apply information gain to optimize the model.

xgb_training <- caret::dummyVars(formula= ~., data = training, fullRank=T,sep = "_")
xgb_training <- data.table(predict(xgb_training, newdata = training))
xgb_bench_training <- as.matrix(xgb_training[, !'price', with=F])

xgb_validation <- caret::dummyVars(formula= ~., data = validation, fullRank=T,sep = "_")
xgb_validation <- data.table(predict(xgb_validation, newdata = validation))
xgb_bench_validation <- as.matrix(xgb_validation[, !'price', with=F])

xgb_test <- caret::dummyVars(formula= ~., data = test, fullRank=T,sep = "_")
xgb_test <- data.table(predict(xgb_test, newdata = test))
xgb_bench_test <- as.matrix(xgb_test[, !'price', with=F])

# applying information gain with the optimal trashold found
weights<- data.frame(information.gain(formula, xgb_training))
weights$feature <- rownames(weights)
weights[order(weights$attr_importance, decreasing = TRUE),]
information_gain_features <- weights$feature[weights$attr_importance > information_gain_xgb_CV(xgb_training,
                                                                                           xgb_validation,
                                                                                           xgb_training$price,
                                                                                           xgb_validation$price,
                                                                                           int_min = 0.001, 
                                                                                           int_max = 0.009, 
                                                                                           steps = 0.001)]

xgb_ig_training <- xgb_training[,c(information_gain_features,"price"),with=FALSE]
xgb_ig_training <- as.matrix(xgb_ig_training[, !'price', with=F])

xgb_ig_test <- xgb_test[,c(information_gain_features,"price"),with=FALSE]
xgb_ig_test <- as.matrix(xgb_ig_test[, !'price', with=F])

xgb_ig_validation <- xgb_validation[,c(information_gain_features,"price"),with=FALSE]
xgb_ig_validation <- as.matrix(xgb_ig_validation[, !'price', with=F])

Now let’s define our first model using all the default params

# defining the model
xgb_0<-xgboost(booster='gbtree',
               data=xgb_bench_training,
               label= training$price,
               nrounds = 100,
               objective='reg:linear')

xgb.pred<-predict(xgb_0, newdata = xgb_bench_validation, type='response')
# evaluation
xgb.mape <- mape(xgb.pred, validation$price)
xgb.price_error <- mean(abs((exp(xgb.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(xgb.pred) -1), observed=(exp(validation$price) -1)))

# Plotting the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("XGBoost", ' MAPE: ', format(round(xgb.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(xgb.price_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

XGB Optimized

To have a better performance I will use grid-search function to find the optimal value of the function’s hyperparameters.

# finding the best combination of hyperparameters
optmimized_param <- GridSerarch_xgb(xgb_ig_training, training$price)

Now we will define the model with the best hyperparameters.

# Defining the model 
xgb_opt<-xgboost(booster='gbtree',
               data=xgb_ig_training,
               label= training$price,
               nrounds = 100,
               max.depth = optmimized_param[,'Depth'],
               eta = optmimized_param[,'eta'],
               subsample = optmimized_param[,'SubSampRate'],
               colsample_bytree = optmimized_param[,'ColSampRate'],
               min_child_weight = 1,
               objective='reg:linear')

xgb.pred.opt<-predict(xgb_opt, newdata = xgb_ig_validation, type='response')
# evaluation
xgb.mape <- mape(xgb.pred.opt, validation$price)
xgb.price_error <- mean(abs((exp(xgb.pred.opt) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(xgb.pred.opt) -1), observed=(exp(validation$price) -1)))

# plotting the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("XGBoost", ' MAPE: ', format(round(xgb.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(xgb.price_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

Regression with StepWise feature selection

In this secion we’ll try to apply Regression with StepWise feature selection.

# adapting the dataset to work for the algorithm
lm_ig_training <- ig_training
lm_ig_training$yr_built <- as.numeric(lm_ig_training$yr_built)
lm_ig_training$yr_renovated <- as.numeric(lm_ig_training$yr_renovated)

lm_ig_validation <- ig_validation
lm_ig_validation$yr_built <- as.numeric(lm_ig_validation$yr_built)
lm_ig_validation$yr_renovated <- as.numeric(lm_ig_validation$yr_renovated)

#### Regression with StepWise feature selection 
lm_0<-stepAIC(lm(formula = formula, 
                 data=lm_ig_training),
              trace=F)

lm.pred<-predict(lm_0, newdata = lm_ig_validation)

# evaluation
lm.mape <- mape(lm.pred, validation$price)
lm.price_error <- mean(abs((exp(lm.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(lm.pred) -1), observed=(exp(validation$price) -1)))

# plotting the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("StepWise", ' MAPE: ', format(round(lm.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(lm.price_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

Regression with regularization

in this section wel’ll try to apply Regression with regularization

# Defining the model
glmnet_cv<-cv.glmnet(x = xgb_ig_training,
                     nfolds = 5,
                     y = training[['price']],
                     alpha=1,
                     family = 'gaussian',
                     standardize = T)
plot.cv.glmnet(glmnet_cv)

glmnet_cv$lambda.min
## [1] 0.0003057683
glmnet_0<-glmnet(x = xgb_ig_training, 
                 y = training[['price']],
                 family = 'gaussian',
                 alpha=1, lambda = glmnet_cv$lambda.min)
glmnet.pred <- predict(glmnet_0, newx = xgb_ig_validation)

# Evaluation
glmnet.mape <- mape(glmnet.pred, validation$price)
glmnetprice_error <- mean(abs((exp(glmnet.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(glmnet.pred) -1), observed=(exp(validation$price) -1)))

# Plotting the results
ggplot(my_data, aes(s0, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("Regression with regularization", ' MAPE: ', format(round(glmnet.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(glmnetprice_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

Final Submission

We splitted the original training data into train and validation to evaluate the candidate models. In order to generate the final submission we have to take instead all the data at our disposal.

In addition, we also applied a log transformation to the target variable, to revert this transformation you have to use the exp function.

In order to do my prediction I have tried all the combination of the models explained above. The best model is the optimized xgb.

# Train the model using all the data
final.model <- xgb_opt

# Predict the prices for the test data (i.e., we use the exp function to revert the log transformation that we applied to the target variable)
final.pred <- as.numeric(exp(predict(final.model, xgb_ig_test, type='response'))-1)
final.pred[is.na(final.pred)]
## numeric(0)
hist(final.pred, main="Histogram of Predictions", xlab = "Predictions")

xgb_submission <- data.frame(Id = original_test_data$id, price= (final.pred))
colnames(xgb_submission) <-c("Id", "price")
write.csv(xgb_submission, file = "submission1.csv", row.names = FALSE)